Text categorization using character shape codes

نویسندگان

  • Larry Spitz
  • Arman Maghbouleh
چکیده

Text categorization in the form of topic identification is a capability of current interest. This paper is concerned with categorization of electronic document images. Previous work on the categorization of document images has relied on Optical Character Recognition (OCR) to provide the transformation between the image domain and a domain where pattern recognition techniques are more readily applied. Our work uses a different technology to provide this transformation. Character shape coding is a computationally efficient, extraordinarily robust, means of providing access to the character content of document images. While this transform is lossy, sufficient salient information is retained to support many applications. Furthermore, the use of shape coding is particularly advantageous over OCR in the processing of page images of poor quality. In this study we found that topic identification performance was maintained or slightly improved using character shape codes derived from images.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling Content Identification from Document Images

A new technique to locate content-representing words for a given document image using abstract representation of character shapes is described. A character shape code representation defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character recognition (OCR). Because character shape...

متن کامل

Content Characterization Using Word Shape Tokens

shape categories, il is possible to automatically extract syntactic information from the text of document images without optical character recognition. Using word shape tokens composed of these charactershapecodes, a properly mr|ned text tagger can extract part-of.speech information fronl scanned document images. Later components of a document processing system can then use this information to ...

متن کامل

Language Determination: Natural Language Processing from Scanned Document Images

Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap ...

متن کامل

Automatic Recognition of Tibetan Buddhist Text by Computer

The purpose of this study is to develop a plausible method to code and compile Buddhist texts automatically from original Tibetan scripts into the Romanized form. We extract syllable from Tibetan texts and recognize automatically the Tibetan characters. The set of Tibetan characters consists of basic 30 consonants, 76 combination characters, and 4 vowels. Despite of the limited number of Tibeta...

متن کامل

Exploration of Contextual Constraints for Character Pre-Classification

We present strategies and results for identifying the symbol type (lower-case, upper-case, digit, and punctuation or special symbols) of every character in a text document by using various kinds of information from neighboring characters. In the expectation of reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-g...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000